The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples.
نویسندگان
چکیده
It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich-Rubinstein, or earth mover's, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich-Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop L(p) Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis 'no difference between two communities' can be approximated by using a Gaussian process functional. We relate the L(2)-case to an analysis-of-variance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent [Formula: see text] random variables.
منابع مشابه
Extreme points of a ball about a measure with finite support
We show that, for the space of Borel probability measures on a Borel subset of a Polish metric space, the extreme points of the Prokhorov, Monge-Wasserstein and Kantorovich metric balls about a measure whose support has at most n points, consist of measures whose supports have at most n+2 points. Moreover, we use the Strassen and Kantorovich-Rubinstein duality theorems to develop representation...
متن کاملEMDUnifrac: Exact Linear Time Computation of the Unifrac Metric and Identification of Differentially Abundant Organisms
Both the weighted and unweighted Unifrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the Unifrac metric is equivalent to the so-called earth mover’s distance (also known as the Kantorovich-Rubinstein metric) to develop an algorithm that not only...
متن کاملOptimal Couplings of Kantorovich-Rubinstein-Wasserstein Lp-distance
The research is supported by Zhejiang Provincial Education Department Research Projects (Y201016421) Abstract We achieve that the optimal solutions according to Kantorovich-Rubinstein-Wasserstein Lp−distance (p > 2) (abbreviation: KRW Lp−distance) in a bounded region of Euclidean plane satisfy a partial differential equation. We can also obtain the similar results about Monge-Kantorovich proble...
متن کاملImaging with Kantorovich-Rubinstein Discrepancy
We propose the use of the Kantorovich-Rubinstein norm from optimal transport in imaging problems. In particular, we discuss a variational regularisation model endowed with a Kantorovich-Rubinstein discrepancy term and total variation regularization in the context of image denoising and cartoon-texture decomposition. We point out connections of this approach to several other recently proposed me...
متن کاملSimulation Hemi-metrics between Infinite-State Stochastic Games
We investigate simulation hemi-metrics between certain forms of turnbased 2 1 2 -player games played on infinite topological spaces. They have the desirable property of bounding the difference in payoffs obtained by starting from one state or another. All constructions are described as the special case of a unique one, which we call the Hutchinson hemi-metric on various spaces of continuous pre...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of the Royal Statistical Society. Series B, Statistical methodology
دوره 74 3 شماره
صفحات -
تاریخ انتشار 2012